Star Hotels

Context

A significant number of hotel bookings are called off due to cancellations or no-shows. The typical reasons for cancellations include change of plans, scheduling conflicts, etc. This is often made easier by the option to do so free of charge or preferably at a low cost which is beneficial to hotel guests but it is a less desirable and possibly revenue-diminishing factor for hotels to deal with. Such losses are particularly high on last-minute cancellations.

The new technologies involving online booking channels have dramatically changed customers’ booking possibilities and behavior. This adds a further dimension to the challenge of how hotels handle cancellations, which are no longer limited to traditional booking and guest characteristics.

The cancellation of bookings impact a hotel on various fronts:

  1. Loss of resources (revenue) when the hotel cannot resell the room.
  2. Additional costs of distribution channels by increasing commissions or paying for publicity to help sell these rooms.
  3. Lowering prices last minute, so the hotel can resell a room, resulting in reducing the profit margin.
  4. Human resources to make arrangements for the guests.

Objective

The increasing number of cancellations calls for a Machine Learning based solution that can help in predicting which booking is likely to be canceled. Star Hotels Group has a chain of hotels in Portugal, they are facing problems with the high number of booking cancellations and have reached out to your firm for data-driven solutions. You as a data scientist have to analyze the data provided to find which factors have a high influence on booking cancellations, build a predictive model that can predict which booking is going to be canceled in advance, and help in formulating profitable policies for cancellations and refunds.

Data Description

The data contains the different attributes of customers' booking details. The detailed data dictionary is given below.

Data Dictionary

Load the data and check the shape/info

Exploratory Data Analysis (EDA)

Check categorical value counts

Define function to plot histogram and boxplot together on the same scale

Define function for labeled barplot

Univariate Analysis

Bivariate Analysis

Model evaluation criterion

Model can make wrong predictions as:

  1. Model predicted a cancellation when the guest would not have actually cancelled - False positive.
    • In the case of double booking, this could create negative views of the hotel by the guest
    • This would require human resources to make alternate arrangements for guests or vouchers for a free stay/other perks in the future
  2. Model predicted no cancellation when the guest would actually cancel - False negative
    • Lost revenue when the hotel cannot resell the room
    • Hotel may incur additional costs by increasing commissions or paying for marketing to help sell these rooms
    • Hotel may need to lower prices so the hotel can resell the room, resulting in reducing profit margin

How to reduce this loss

Data Preparation

Logistic Regression (with Sklearn library)

Checking performance on training set

Checking performance on test set

Feature Transformation

Some features are very skewed and may behave better on a different scale. We can try to transform continuous variables, so we will check lead_time and avg_price_per_room.

We will choose square root transformation because it makes the data slightly less skewed without having to add a value to the data and may be more easily interpretable than the other transformations

Logistic Regression (with Sklearn library)

Checking performance on training set

Checking performance on test set

Logistic Regression (with statsmodels library)

Observations

Note: Removing variables with high $p$-values has enabled optimization to terminate successfully. Previously, the maximum number of iterations was exceeded.

Now no feature has $p$-value > 0.05 so we will consider the features in x_train2 as the final ones and lg3 as the final model.

Coefficient Interpretations

Converting coefficients to odds

Coefficient interpretations

Checking model performance on the training set

ROC-AUC

Model Performance Improvement

Optimal threshold using AUC-ROC curve

Checking model performance on training set

Let's use Precision-Recall curve and see if we can find a better threshold

Checking model performance on training set

Model Performance Summary

Let's check performance on the test set

Dropping the columns from the test set that were dropped from the training set

Using model with default threshold

Using model with threshold = 0.32

Using model with threshold = 0.43

Model performance summary

Conclusion

Recommendations

Split Data for Decision Tree

Build Decision Tree Model

Checking model performance on the test set

Visualizing the Decision Tree

Reducing over fitting

Visualizing the Decision Tree

Observations from the tree:

Using the above extracted decision rules we can make interpretations from the decision tree model like:

* For a booking with lead time no more than 86.5 days with 0 special requests, by a non-repeat guest that is neither from Online nor Offline market segments, with average price per room no more than 59.50 euro, the booking will not be canceled. But if the average price per room is greater than 59.50 euro, the booking will be canceled.

Interpretations from other decision rules can be made in a similar manner.

Cost Complexity Pruning

Next, train a decision tree using the effective alphas. Last value in ccp_alphas is the alpha value that prunes the whole tree, clfs[-1], with one node

For the remainder, we remove the last element in clfs and ccp_alphas, because it is the trivial tree with only one node. Here we show that the number of nodes and tree depth decreases as alpha increases.

Checking performance on training set

Checking performance on test set

Visualizing the Decision Tree

This model is still quite complicated and not very easy to interpret

Creating model with 0.005 ccp_alpha

Comparing all decision tree models

Conclusions

Recommendations